Throughout the COVID-19 pandemic, the U.S. has seen multiple “waves” of increases in case incidence numbers, occurring at different times in different areas of the country, and with even greater variation on the state and county level. It has been difficult not only for public health officials and other policy makers to anticipate when case incidences might rise in their area, but also for researchers who are forecasting the pandemic to accurately predict future case numbers. Part of the challenge in guiding decision making has been understanding how to leverage the combination of mobility, public health, and survey signals to capture the complex dynamics and transmission of COVID-19.
Our directive for this project specifically instructed us to analyze Delphi’s signals as potential leading indicators of significant rises in cases at the county level across multiple distinct periods of time. The goal is to provide greater insight into the value of Delphi’s signals in predicting future increases in cases.
Initial exploration: Cross Correlation
Our first approach to this DAP looked at the relationship between the indicator and the signal more generally. We first used cross correlation analysis on the time series to identify the relationship between indicators and cases across a time period. For two time series \(y, x \in \mathbb{R}^T\), cross-correlation is defined as:
\[\max_{i} Corr(y_{i+1,\cdots, T}, x_{1, \cdots, T-i}),\] and measures the maximum Pearson correlation between the two as a result of lagging one by the amount of \(i\).
We calculated the cross-correlation and the optimal lag in each county. An example of this data over all observed counties for the Drs Visits indicator signal:
An example of this data over all observed counties for the Drs Visits indicator signal:
In order to ascertain leading-ness of an indicator, we want to determine whether the indicator began to rise significantly before cases began to rise significantly.
As a core component of this analysis, we need methodology to accurately identify significant rises, given a single time series of an indicator. This is non-trivial, since the data is quite noisy at the county level, and clean rise/drops are rare.
Starting at the Peak
At first we experimented with finding the peak of a signal in a given time period and identifying the closest local minimum that precedes the peak. However this is not always the point at which the signal actually begins to rise (could be caught in a shallow local minimum), and does not gaurantee the rise would be a lengthy or a steep one. This method also only picks one rise period for a signal for every county for the given time period, which isn’t always reflective of the signal’s actual behavior.
Best Fit Line
One option we tried was calculating a line of best fit for the signal for fixed time periods within a larger time period. For example, calculating a line of best fit for every 21 day window within a 3 month window and choosing the period that has the highest slope as the most significant rise period in that county for that signal.
Example plot
Estimated Derivative
We then tried using multiple different derivative estimate methods to identify periods where the estimated derivative at each point is over a certain threshold.
Example plot
We saw that smoothing the signal first using smoothing splines (in addition to the 7-day average smoothing already applied to the data, e.g. 7-day average CLI) and using the derivative method produced the best results. Twekaing this method with some other decision rules gave us our best outcome for finding periods of significant rise.
Final criteria for rise periods: A period is a significant rise in a smoothed signal if
First derivative at each point is > 0 - this means the signal is in fact rising on every day
Period is > a certain number of days (for this analysis we used TODO) - this means the rise is not spurious
Each first derivative is > a certain % of other derivatives in time period (Note, for this analysis we set this to 0%, effectively not using this parameter) - if not set to 0, it can mean the rise is a significant one for this county but also ties this decision to the specific time period we are looking at. The rise point identifications can change based on the time period if this is set greater than 0.
Magnitude of increase from start to end of period is > a certain threshold (for this analysis we used TODO) - this is another way to make sure that the rise is significant, not just a slight uptick in cases
Finally, we take the point at the beginning of each rise period as the best estimation of a point of inflection where a signal begins to rise significantly, so we can address the question: Does the beginning of a rise in the indicator come before the beginning of a rise in cases?
In our final analysis, we look at this on a county by county basis over a continuous time period from 3/15/20 to 5/21/21.
In our analysis, we include all counties that have greater than 20 cases a day on average and indicator data for 90% of days in a given time range, and do not have zero or negative values for either cases or the indicator on any day.
To assess the performance of our methodology, we used two different approaches to evaluate recall and precision: a per time point and per rise point analysis. Both approaches are described below.
In a per time point evaluation approach, we quantified the performance of our methodology in comparison to two baseline guessers: a random guesser and a first derivative guesser. In this approach, each day counts as a prediction event.
A. Random Guesser
The random guesser marks each day as a rise prediction or not, randomly.
B. First Derivative Guesser
The first derivative guesser predicts that cases will rise if the indicator first derivative is greater than 0. Each day that the indicator first derivative is greater than 0 is marked as a rise prediction. The same is done for a first derivative guesser on cases as well (predicting cases will rise if they are already rising)
C. Our Guesser
We designate our model’s rise predictions starting on the day of each indicator rise point, and for each day afterwards, for X number of days.
For our truth value, we mark a future rise starting on the day of each case rise point, and for each day before that, for X number of days.
D. Evaluation
To calculate recall and precision for each guesser, for each county, every day that the guesser marks a rise prediction and the truth value shows a future rise, we count as a true positive. Each day that the guesser marks a rise prediction but the truth value does not have a future rise marked, we count as a false positive. Each day that the guesser did not mark a rise prediction, but the truth value does have a future rise marked, we count as a false negative.
By aggregating true positives, false positives, and false negatives across the entire time horizon, we calculate recall and precision as follows:
Recall is calculated by # true positives / (# true positives + # false negatives).
Precision is calculated by # true positives / (# true positives + # false positives)
E. Limitations
While this method allows us to evaluate guessers to see how our model fares compared to a random guesser or a naive guesser, the recall and precision numbers do not necessarily accurately reflect the accuracy of the models. This method compares the performance of our leading indicator methodology with other guessers under particular assumptions of how leading the indicator is generally compared to cases. In this method, we designate a number X as our ideal leadingness and score situations in which the indicator rise precedes the case rise by exactly X days as more successful than if the indicator rise precedes the case rise by a day more or less. In reality we want to count as a success a window of differing lengths of leadingness. We ideally want to count an indicator rise point preceding a case rise point by 7 days as just as successful as one preceding a case rise point by 8 days, for example.
F. Note to Consider
While most other predictive models are evaluated on unchanging ground truth,the ground truth in this method can change based on the parameters selected.
We also evaluated our model’s performance using a per rise point approach. In contrast to the per time point approach where each day counts as an event in our model evaluation, the per rise point approach compares the recall and precision across the calculated rise points. This comparison, which allows for a flexible window of leadingness, likely enables a more meaningful interpretation of our model’s performance in the real world.
A. Methodology and Evaluation
We count all the indicator rise points and all the case rise points in all the counties. In each county, for each indicator rise point we add 1 to our true positive (or success) count if it is followed by a case rise point at least Y days and no more than Z days later. We sum the number of successes in each county.
Recall is calculated by # of true positives / (# of true positives + # false negatives) = #successes / #case rise points
Precision is calculated by # of true positives / (# of true positives + # false positives) = #successes / #indicator rise points
B. Limitations
One caveat on this method is that if there are multiple indicator rise points close together, those may be double counted when calculating the number of successes. For example, one indicator rise point could precede a case rise point by 7 days, and another indicator rise point, occurring 2 days after the first, could precede the same case rise point by 5 days. We would count 2 successes. Since the two consecutive indicator rise points would almost certainly not mark two real separate rises in the indicator, counting two successes in this case is a limitation of how we calculate the rise points.
Using this method of calculating recall and precision, we cannot easily compare our method with the same strawman guessers as the per time point approach. One reason for this difficulty is that the number of rise prediction points is drastically different between guessers, where strawman guessers have many positive predictions. As this methodology for calculating recall and precision also rewards these spurious predictions, there is no easy method of comparison between approaches.
In this section, we describe our pipeline for processing, plotting and analyzing the data using the methodology described above.
As an example, we’ll use our Dr Visits % CLI as our indicator. We use our LeadingIndicatorTools package for all our main functions.
drs_visits = get_and_parse_pre_read_signals(cases_df, doctors_df)
drs_visits_with_points = get_increase_points(drs_visits$cases, drs_visits$indicator)
plot_signals(get_subset_of_time("2020-05-01", "2020-11-30", drs_visits_with_points), "01003", smooth_and_show_increase_point=FALSE, "Drs Visits")
In the respective rise point columns, the day is marked with a 1 if it is found to be a rise point for that signal. We can see here that there is a rise point for Drs Vists on 5/22/2020 and for cases on 5/31/2020.
head(drs_visits_with_points[[1]],500)
We can see that Drs Visits begins to rise before cases rise. TODO I think we need to tweak our rise point method a bit so we don’t have these “double counting” points on a rise.
plot_signals(get_subset_of_time("2020-05-01", "2020-11-30", drs_visits_with_points), "01003", smooth_and_show_increase_point=TRUE, "Drs Visits")
We can plot some of the counties where rises in the doctor visits indicator consistently lead rises in cases.
We can also look at the distribution of the frequency of the number of days by which Doctor Visits’ rises lead case rises (when the case rise occurs between 2 to 14 days after the indicator rise)
We can plot some of the counties where rises in one of the google symptoms signals (sum_anosmia_ageusia_smoothed_search) consistently lead rises in cases.
We can also look at the distribution of the frequency of the number of days by which the signal (sum_anosmia_ageusia_smoothed_search) rises lead case rises (when the case rise occurs between 2 to 14 days after the indicator rise)
We can plot some of the counties where rises in the quidel indicator (covid_ag_smoothed_pct_positive) consistently lead rises in cases.
We can also look at the distribution of the frequency of the number of days by which Quidel rises lead case rises (when the case rise occurs between 2 to 14 days after the indicator rise)
We can plot some of the counties where rises in the FB CLI indicator (smoothed_whh_cmnty_cli) consistently lead rises in cases.
We can also look at the distribution of the frequency of the number of days by which FB CLI rises lead case rises (when the case rise occurs between 2 to 14 days after the indicator rise)
We can plot some of the counties where rises in the FB public transit indicator (smoothed_wpublic_transit_1d) consistently lead rises in cases.
We can also look at the distribution of the frequency of the number of days by which rises in the FB public transit signal (smoothed_wpublic_transit_1d) lead case rises (when the case rise occurs between 2 to 14 days after the indicator rise)
We can plot some of the counties where rises in the SafeGraph away from home indicator (full_time_work_prop_7dav) consistently lead rises in cases.
We can also look at the distribution of the frequency of the number of days by which rises in the away from home signal (full_time_work_prop_7dav) lead case rises (when the case rise occurs between 2 to 14 days after the indicator rise)
We can plot some of the counties where rises in the SafeGraph bar visits indicator (bars_visit_prop) consistently lead rises in cases.
We can also look at the distribution of the frequency of the number of days by which rises in the bar visits signal (bars_visit_prop) lead case rises (when the case rise occurs between 2 to 14 days after the indicator rise)
We can plot some of the counties where rises in the SafeGraph restaurants visits indicator (restaurants_visit_prop) consistently lead rises in cases.
We can also look at the distribution of the frequency of the number of days by which rises in the restaurants visits signal (restaurants_visit_prop) lead case rises (when the case rise occurs between 2 to 14 days after the indicator rise)
On the example plots there might be marked points in the examples towards the end of the time period plotted that look like they are preceeding drops in cases or indicators, but since the plots are spaced out in subsets of time periods, the subsequent rise is likely a part of the next time period. Likewise there might be an indicator rise point whose matching case rise point is outside the time period, or vice versa.
## 1
## 2
## 3
## 4
TODO Maybe include: interpolation issue where there are certain days that don’t have case/signal data. could be solved by just increasing “indicator_threshold” by a lot, allowing very very few days where there is not a signal for the indicator.